RabbitMQ is an open source message broker, useful in almost any technology stack. It's built in Erlang and has a reputation for being a robust and reliable technology. However, in this article we'll look at what happens when things are not going smoothly, and how to handle retries when processing RabbitMQ messages.

The Scenario

Being defensive in developing applications is important; if an error should occur then our application should detect it and react accordingly. With RabbitMQ, an application places messages onto a queue, and these messages are then processed by scripts called workers. There might be many workers running at one time. If the message can’t be processed, we might want to retry, in case there was a temporary glitch that doesn’t recur. In other cases, more advanced logic may be needed to enable a certain number of retries. Let us explore the options …

Un-acknowledged Messages

By default, RabbitMQ has queues that expect a message to be “acknowledged” when processing has been successfully completed. If the message is assigned to a worker but the worker does not acknowledge successful completion of processing within a certain (configurable) timeframe, then the message will be assigned to another worker. This is effectively a retry logic in the situation where your worker is crashing when it tries to process the message.

Reject and Requeue

A better option is for the worker to detect that it cannot process the message. In this case, the worker can return a “reject” response, indicating that the message was not processed and removing the need to wait for the timeout to expire. It is also possible to indicate that the message should be requeued; in which case the message returns to the front of the queue and another worker will try to process it.

IPC NEWSLETTER

All news about PHP and web development

Beware Poison Messages

In both of the situations we have looked at so far, the messages will continue to be requeued repeatedly regardless of whether they can _ever_ be processed. This causes a problem because we waste resource and can even render the system unusable with these bad messages – they are known as “poison messages”. With this in mind, it can be better to implement a restricted number of retries.

Counting Retries

RabbitMQ does not have a built-in way to handle a set number of retries, but we can add a little metadata to our message to track this. Since messages are immutable, this involves creating a new message which looks a lot like the old one. The process looks something like this:

Try to process message; fail.
Check if we’ve already retried this message by looking for a header (called anything you like but I usually use `X-Retries`).
If we have exceeded the number of retries, reject the message without requeueing it.
If not, create a new message identical to the previous one but with `X-Retries` implemented (or created and set to one, if it wasn’t already there). Put this new message onto the queue.
Send an acknowledgement for the original message to remove it from the queue.

This has the benefit that the new message is added to the back of the queue, enabling other messages to be processed before it and reducing the effect of many failing messages if there are still some healthy ones coming through the queue as well. Especially at busy times where the queue time might be a little longer, those few extra seconds can sometimes allow a system to recover enough for the retry to succeed – not very scientific but one factor that may influence your choice of retry strategy for your own applications.

Learn more about this topic in my session Queues with RabbitMQ at International PHP Conference (October 23 – 27, 2017 in Munich).